I've seen a couple of nice kernels here, but no one explained the importance of a morphological pre-processing of the data. So I decided to compare two approaches of a morphological normalization: stemming and lemmatization. Both of them reduce the word to the regularized form, but a stemming reduces the word to the word stem, and a lemmatization reduces the word to it's morphological root with the help of dictionary lookup.

I evaluate the efficiency of these approaches by comparison their performance with the naive Bag of Means method: every word is encoded with a word embedding vector, and then the common vector of two messages is computed as a mean vector of these vectors. Some of the researches proved that such approach can be a very strong baseline (Faruqui et al., 2014; Yu et al., 2014; Gershman and Tenenbaum, 2015; Kenter and de Rijke, 2015). Then I use obtained vectors as feature vectors to train the classifiers.

I will also make a comparison with a default approach (no morphological pre-processing).

Okay, let's load NLTK and try to implement these two approaches with a Lancaster Stemmer (one of the most popular stemming algorithms) and a WordNet Lemmatizer (based on WordNet’s built-in morphy function):


In [1]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem import LancasterStemmer
stemmer =  LancasterStemmer()
lemmer = WordNetLemmatizer()

A visible example of how do they work:


In [2]:
print(stemmer.stem('dictionaries'))
print(lemmer.lemmatize('dictionaries'))


dict
dictionary

So, what approach will be better for the given task? Let's see.

First of all, we need to load modules for linear algebra and data analysis as well as gensim (for training a Word2Vec, a classic algorithm for obtaining word embeddings). We also need some stuff from scikit-learn to teach and evaluate the classifier and pyplot to draw plots. seaborn will make the plots more beautiful.


In [3]:
from gensim import models
import numpy as np
from pandas import DataFrame, Series
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
from gensim import models
import matplotlib.pyplot as plt
import seaborn

And a little bit more of the linguistic tools! We will use a tokenization( breaking a stream of text up into meaningful elements called tokens, for instance, words) and a stop-word dictionary for English.


In [4]:
from nltk.corpus import stopwords
from nltk.tokenize import wordpunct_tokenize, RegexpTokenizer
stop = stopwords.words('english')
alpha_tokenizer = RegexpTokenizer('[A-Za-z]\w+')

And check if the .csv-files with the data are okay.


In [5]:
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))


test.csv
train.csv

So let's write some code. First of all, let's train a Word2Vec model. We will use the training set as a training corpus (Previously I used the test set, but it uses much more memory while the model trained on it has the same efficiency; thanks to @Gian12 for the notion). This set contains some NaN values, but we can just drop them since in our task their lack is not meaningful.


In [6]:
df_train = DataFrame.from_csv('../input/train.csv').dropna()

Let's make a list of sentences by merging the questions.


In [7]:
texts = np.concatenate([df_train.question1.values, df_train.question2.values])

Okay, now we are up to the key method of preprocessing comparation. It provides lemmatization or stemming depending on the given flag.


In [8]:
def process_sent(words, lemmatize=False, stem=False):
    words = words.lower()
    tokens = alpha_tokenizer.tokenize(words)
    for index, word in enumerate(tokens):
        if lemmatize:
            tokens[index] = lemmer.lemmatize(word)
        elif stem:
            tokens[index] = stemmer.stem(word)
        else:
            tokens[index] = word
    return tokens

And then we can make two different corpora to train the model: stemmed corpus and lemmatized corpus. We will also make a "clean" corpus for sure.


In [9]:
corpus_lemmatized = [process_sent(sent, lemmatize=True, stem=False) for sent in texts]

In [10]:
corpus_stemmed = [process_sent(sent, lemmatize=False, stem=True) for sent in texts]

In [11]:
corpus = [process_sent(sent) for sent in texts]

Now let's train the models. I've pre-defined these hyperparameters since models on them have the best performance. You can also try to play with them yourself.


In [12]:
VECTOR_SIZE = 100

In [13]:
min_count = 10
size = VECTOR_SIZE
window = 10

In [14]:
model_lemmatized = models.Word2Vec(corpus_lemmatized, min_count=min_count, 
                                   size=size, window=window)

In [15]:
model_stemmed = models.Word2Vec(corpus_stemmed, min_count=min_count, 
                                size=size, window=window)

In [16]:
model = models.Word2Vec(corpus, min_count=min_count, 
                                size=size, window=window)

Let's check the result of one of the models.


In [17]:
model_lemmatized.most_similar('playstation')


Out[17]:
[('ps4', 0.8302322626113892),
 ('ps3', 0.7695804238319397),
 ('console', 0.768163800239563),
 ('xbox', 0.7651442885398865),
 ('pirated', 0.7557218074798584),
 ('gta', 0.7459744215011597),
 ('mod', 0.7215191721916199),
 ('geforce', 0.6910909414291382),
 ('fifa', 0.6814947128295898),
 ('pc', 0.6761049032211304)]

Great! The most similar words seem to be pretty meaningful. So, we have three trained models, we can encode the text data with the vectors - let's make some experiments! Let's make data sets from the loaded data frame. I take a chunk of the traning data because the run of the script on the full data takes too much time.


In [18]:
q1 = df_train.question1.values
q2 = df_train.question2.values
Y = np.array(df_train.is_duplicate.values)

A little bit modified preprocess. Now it returns only words which model's vocabulary contains.


In [19]:
def preprocess_check(words, lemmatize=False, stem=False):
    words = words.lower()
    tokens = alpha_tokenizer.tokenize(words)
    model_tokens = []
    for index, word in enumerate(tokens):
        if lemmatize:
            lem_word = lemmer.lemmatize(word)
            if lem_word in model_lemmatized.wv.vocab:
                model_tokens.append(lem_word)
        elif stem:
            stem_word = stemmer.stem(word)
            if stem_word in model_stemmed.wv.vocab:
                model_tokens.append(stem_word)
        else:
            if word in model.wv.vocab:
                model_tokens.append(word)
    return model_tokens

This method will help to obtaining a bag of means by vectorising the messages.


In [20]:
old_err_state = np.seterr(all='raise')

def vectorize(words, words_2, model, num_features, lemmatize=False, stem=False):
    features = np.zeros((num_features), dtype='float32')
    words_amount = 0
    
    words = preprocess_check(words, lemmatize, stem)
    words_2 = preprocess_check(words_2, lemmatize, stem)
    for word in words: 
            words_amount = words_amount + 1
            features = np.add(features, model[word])
    for word in words_2: 
            words_amount = words_amount + 1
            features = np.add(features, model[word])
    try:
        features = np.divide(features, words_amount)
    except FloatingPointError:
        features = np.zeros(num_features, dtype='float32')
    return features

And now we can obtain the features matrices.


In [21]:
X_lem = []
for index, sentence in enumerate(q1):
    X_lem.append(vectorize(sentence, q2[index], model_lemmatized, VECTOR_SIZE, True, False))
X_lem = np.array(X_lem)

In [22]:
X_stem = []
for index, sentence in enumerate(q1):
    X_stem.append(vectorize(sentence, q2[index], model_stemmed, VECTOR_SIZE, False, True))
X_stem = np.array(X_stem)

In [23]:
X = []
for index, sentence in enumerate(q1):
    X.append(vectorize(sentence, q2[index], model, VECTOR_SIZE))
X = np.array(X)

That's almost all! Now we can train the classifier and evaluate it's performance. It's better to use a metric classifier because we are performing operations in the vector space, so I choose a Logistic Regression. But of course you can try a something different and see what can change.

I also use cross-validation to train and to evaluate on the same data set.


In [24]:
estimator = LogisticRegression(C = 1)
cv = ShuffleSplit(n_splits = 10, test_size=0.1, random_state=0)
train_sizes = np.linspace(0.1, 0.9, 10)
train_sizes, train_scores = learning_curve(estimator, X_lem, Y, cv=cv, train_sizes=train_sizes)
train_scores_lem = np.mean(train_scores, axis=1)

In [ ]:
estimator = LogisticRegression(C = 1)
cv = ShuffleSplit(n_splits = 10, test_size=0.1, random_state=0)
train_sizes = np.linspace(0.1, 0.9, 10)
train_sizes, train_scores = learning_curve(estimator, X_stem, Y, cv=cv, train_sizes=train_sizes)
train_scores_stem = np.mean(train_scores, axis=1)

In [ ]:
estimator = LogisticRegression(C = 1)
cv = ShuffleSplit(n_splits = 10, test_size=0.1, random_state=0)
train_sizes = np.linspace(0.1, 0.9, 10)
train_sizes, train_scores = learning_curve(estimator, X_, Y, cv=cv, train_sizes=train_sizes)
train_scores = np.mean(train_scores, axis=1)

In [ ]:
title_font = {'size':'10', 'color':'black', 'weight':'normal',
                  'verticalalignment':'bottom'} 
axis_font = {'size':'10'}

plt.figure(figsize=(10, 5))
plt.xlabel('Training examples', **axis_font)
plt.ylabel('Accuracy',  **axis_font)
plt.tick_params(labelsize=10)

plt.plot(train_sizes, train_scores_lem, label='Lemmetization', linewidth=5)
plt.plot(train_sizes, train_scores_stem, label='Stemming', linewidth=5)
plt.plot(train_sizes, train_scores, label='Default', linewidth=5)
  
plt.legend(loc='best')
plt.show()

So, the lemmatized model outperformed the "clear" model! And the stemmed model showed the worst result. Why does it happen?

Well, any morphological pre-processing of the training data for the model reduces the amount of information that model can obtain from the corpus. Some of the information, like the difference in morphological roots of the same words, seems to be not necessary, so it is better to remove it. This removal is a mush-have in synthetic languages (languages with high morpheme-per-word ratio, like Russian), and, as we can see, it is also pretty helpful in our task.

The same thing about stemming. Stemming further reduces the amount of information, making one stem for the different word forms. Sometimes this is helpful, but sometimes this can bring noise to the model since some stems of the different words can be ambiguous, and the model can't be able to separate "playstation" and, say, "play".

In other words, there is no silver bullet, and you should always check various option of pre-processing if you want to reach the best performance. However, lemmatisation nine times out of ten will increase the performance of your model.

However, the logarithmic loss of my approach is not very high, but you can use this notebook as a baseline and try to beat it's score yourself! Just download it and uncomment the commented strings (because Kaggle doesn't allow to use so much memory)


In [ ]:
clf = LogisticRegression(C = 1)
clf.fit(X, Y)

#df_test = DataFrame.from_csv('../input/test.csv').fillna('None')
q1 = df_train.question1.values[:100]
q2 = df_train.question2.values[:100]
#q1 = df_test.question1.values
#q2 = df_test.question2.values

X_test = []
for index, sentence in enumerate(q1):
    X_test.append(vectorize(sentence, q2[index], model, VECTOR_SIZE))
X_test = np.array(X_test)

result = clf.predict(X_test)

sub = DataFrame()
sub['is_duplicate'] = result
sub.to_csv('submission.csv', index=False)

Thanks for reading this notebook. I'm glad if it helped you to learn something new.

I will highly appreciate any critique or feedback. Feel free to write your thoughts at the comments section!